You are viewing the RapidMiner Studio documentation for version 10.1 - Check here for latest version
Generate Extract (Text Processing)
Synopsis
Extracts values from structured and unstructured sources using XPath expressions, regular expressions, simple string matching or JSONPath expressions.Description
This operator allows to extract additional attributes from structured and unstructured text using regular expressions, XPath, JSONPath or simple string matching. The input texts are take from the specified source attribute, which must be nominal. The query type can be either XPath for XML documents, JSONPath for JSON documents, or regular expressions for less structured texts. The XPath expression specifies directly which part of the XML document is retrieved and this is used as value for the new attribute. If you use regular expressions, the first matching group is used as value. For example an expression like "Name:\s*(.*)\n" on a text "Name: Paul" followed by a line break will yield "Paul" as new value in the attribute. String matching is a fast and easy to use replacement for regular expressions, but less powerful. You just have to specify a start and an end string. Everything between the two strings is extracted. For example if the start string would be "Name:" and the end string a linebreak, then the result of the above text would be " Paul". The response might contain a separated list of results, for example a XML tag like this: en,de,fr,sp Then it is possible to enter the a query yielding "en,de,fr,sp" multiple times, using different attribute names. If the separator parameter contains the ",", then the first attribute will be filled with "en" the second with "de" and so on. This might be used to get only the first enumerated value, too. But be careful with this feature, since other results might be splitted, too, even if you don't enter a query twice. You might avoid this, by inserting a second operator, where you don't specify a separator.
Input
- example set (Data Table)
Output
- example set (Data Table)
Parameters
- source attribute The content of this nominal attribute is used for extracting informations. Range: string
- query type Specifies the type of the query. Range: selection
- string matching queriese Specifies a list of string matching start and end sequences. Everything between will be used as result. See the operator documentation for details on string matching. Range: list
- attribute type Specifies the type of the resulting attributes. If numerical or binomial is choosen, ensure that the returned result is interpretable. Range: selection
- regular expression queries Specifies a list of attribute names and their corresponding regular expressions. The first matching group is used as value. See the operator documentation for details on regular expressions. Range: list
- regular region queries Specifies a list of attribute names and their corresponding regular expressions. Two regular expressions might be specified in order to define the start and the end of a region. Everything in between the two matches will be delivered as result. Range: list
- xpath queries Specifies a list of attribute names and their corresponding XPath queries. See the operator documentation for details on XPath. Range: list
- namespaces Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h. Range: list
- ignore CDATA Indicates if CDATA should be ignored when using the XPATH expression. Range: boolean
- assume html If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this. Range: boolean
- index queries Specifies a list of attribute names and the regions. Regions are specified as offset index and length of the match. Range: list
- jsonpath queries Specifies a list of attribute names and their corresponding JSONPath queries. Range: list
- value separator The character separating two values of the same query results. See operator info for more details. Range: string